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(54) Computer architecture containing processor and coprocessor 



(57) A computer system comprises a first processor 
1 and a second processor 2 for use as a coprocessor to 
the first processor 1 . The system has a main memory 3. 
The system also has a decoupling element 8 such that 
instructions are passed to the second processor 2 from 
the first processor 1 through the decoupling element 8. 
This has the effects that the second processor 2 con- 
sumes instructions derived from the first processor 1 
through the decoupling element 8, and that the second 
processor 2 receives data from and writes data to the 
memory 3. The processing of instructions by the second 



processor 2 can thus be decoupled from the operation 
of the first processor 1 . 

This is particularly effective for processing of a com- 
putationally intensive task (such as a media computa- 
tion) on an architecture with a general purpose first proc- 
essor 1, using a second processor 2 adapted for the 
computationally intensive task. This can effectively be 
combined with use of a buffer memory 5 adapted to ex- 
change data particularly rapidly with the memory 3 in 
response to memory instructions, together with a further 
decoupling element 6 to decouple the buffer memory 5 
from the first processor 1 . 
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Description 

FIELD OF INVENTION 



[0001] The invention relates to computer architectures involving a main processor and a coprocessor. 
DESCRIPTION OF PRIOR ART . 

suited to a given computational task such a FPGA^ p be programmed with a configuration particularly 

SUMMARY OF INVENTION! 

[0005] Accordingly, there is provided computer system, comprising: a first processor- a second processor for a . 

[0006] This arrangement can produce considerable improvements in performance as the first processor tvnioaiu, a 
general purpose microprocessor, can switch tasks while execution of the instructing Tea rTied Tufon h IT ^ 
processor, typica.fy a processor specially adapted to carry out the computer for ^of oSjSS de. gated to it 
This .s very important when the first processor is the central processing unit of a ^pute^S^Sufm^ 

oXX^ 

teSk re ' atin9 10 ,he com P utation lhat ™V be left to the first processor is servicing of the decouolino 
element (so that ,t can prov.de instructions effectively). Advantageously, the decoupling elemen may be se fuTscX 
it will require no such servicing during performance of the delegated task V P 

fn^! 1 P ° SSible Ch ° iCe ° f decou P ,in 9 element j s a coprocessor instruction queue, wherein instructions are added 
' nStrUC,i0n qU6Ue ^ flret and "d '-'he coprocesrSrorqur; foy 

[0009] An allernative choice is a slate machine, wherein information to provide instructions is provided to ihe 
machine by the first processor, and instructions are provided in an ordered seouJS thT! T 
statemachine.AludheralternativechoiceisathirdpLessor.whr^ 

f^con^^ 

^L T to= 

J! ™ZZ T ^ ,0 T ™ m0ty ThiS h3S Si9nmcant nee benemsUmedt aCi hms t pa EUl 

the memory ,s dynamic random access memory, and the buffer memory is adapted to load date 'from or stoTeSti ta 
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the butter memory in bursts. 

[0012] Decoupling ot the tirst processor trom the buffer memory can be achieved by use of a second decoupling 
element, wherein memory instructions relating to movement of data between the buffer memory and the memory are 
passed to the buffer memory from the first processor through this second decoupling element, such that the buffer 
memory consumes instructions derived from the first processor through the second decoupling element. The process- 
ing of memory instructions by the buffer memory is thus decoupled from the operation of the first processor 
[001 3) Where such a buffer memory is used, and as the first processor is decoupled from the other system elements, 
it is desirable for there to be a synchronisation mechanism to synchronise transfer of data between the buffer memory 
and the memory with execution of instructions by the second processor. Preferably, this is adapted to block execution 
of instructions by the second processor on data which has not yet been loaded to the buffer memory from the memory, 
and is adapted to block execution memory instructions for storage of data from the buffer memory to the memory where 
relevant instructions have not yet been executed by the second processor. Greatest efficiency is achieved when if 
execution of instructions or memory instructions is blocked by the synchronisation mechanism, other instructions or 
memory instructions which are not blocked by the synchronisation mechanism may still be carried out. 
[0014] In a further aspect, the invention provides a method of operating a computer system, comprising: providing 
code for execution by a tirst processor; extraction from the code of a task to be carried out by a second processor 
acting as coprocessor to the first processor; passing information defining the task from the first processor to a decou- 
pling element; passing instructions derived from said information from the decoupling element to the second processor 
and executing said instructions on the second processor, wherein the processing of said instructions by the second 
processor is decoupled from the operation ol the first processor. 

BRIEF DESCRIPTION OF FIGURES 
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[0015] Specific embodiments of the invention will be described further below, by way of example, with reference to 
the accompanying drawings, in which: 

Figure 1 shows the basic elements of a system in accordance with a first embodiment of the invention; 
Figure 2 shows the architecture of a burst buffers structure used in the system of Figure 1; 
Figure 3 shows further features of the burst buffers structure of Figure 2; 

Figure 4 shows the structure of a coprocessor controller used in the system of Figure 1 and its relationship to other 
system components; 

Figure 5 shows an example to illustrate a computational model usable on the system of Figure 1 ; 
Figure 6 shows a timeline for computation and I/O operations for the example of Figure 5; 

Figure 7 shows an annotated graph provided as output from the frontend of a toolchain useful to provide code for 
the system of Figure 1; 

Figure 8 shows a coprocessor internal configuration derived from the specifications in Figure 7; 

Figure 9 shows the performance of alternative architectures for a 5x5 image convolution using 32 bit pixels; 

Figure 10 shows the performance of the alternative architectures used to produce Figure 9 for a 5x5 image con- 
volution using 8 bit pixels; 

Figures 1 1 A and 1 1 B show alternative pipeline architectures employing further embodiments of the present inven- 
tion; 

Figure 12 shows two auxiliary processors usable as an alternative to the coprocessor instruction queue and the 
burst instruction queue in the architecture of Figure 1; and 

Figure 13 shows implementation of a state machine as an alternative to the coprocessor instruction queue in the 
architecture of Figure 1 . 
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DESCRIPTION! OF SPEHlFir EMBODIMENTS 



between the processor 1 and the co^rSSsor £or SETS* ' f f b "f ed so that a ^^icn can be partitioned 
tially any general purpose processor , ZZZ^^ computafonal efficiency. The processor 1 may be essen- 
of handling with significantly greater e^e^ 

essentially the whole computation fs to b L t£ 1 calcula fn. the specific system described here, 

the invention is not limited to ST^S£lSS!2 Pr ° C6SS ° r * by "» PrOC6SS ° r 1 " however - 

instead (with corresponding modlZi^^^^.^ia Bofh^ ° SPS ' ^ * 

2 have access to a DRAM main memorv th™,nh '' i PUla " onal m ° d f 1 re q"'red). Both the processor 1 and coprocessor 

4, typically SRAM. Efl£ZS£2 ™7he DRAM 3 f/n^HK "? ^ J* 0838 * 3 CaChe ° f faster access 
w«h DRAM for the efficient ^^J^^^J^^ 7™* 5 ada P' ed <° communicate 
Instructions to the burst buffers 5 Tare 2 ^ hroul « t ^ Wi " be describe d further below, 

under the control of a burst buffer control 7 rZTrZ^ ?TT° n qU9U6 6 ' and ,he burst buflers 5 °Perste 
be.ow, in the architecture a^SSV. c^oZlTinT^ " reaS ° nS diSCUSS6d 

processor instruction queue 8 and the cop oLsto™itif , T T th ° co P rocess ° r 2 *™ provided in a co^ 
nisation of the operation of the bursTbu^andlhe SST C ° ntr °' °' 9 co P rocessor stroller 9. Synchro- 

by a specific mechanism, ^^IZ^JZS^ I inS,rUCtbn qUeU9S is acnieved 

comprises the load/execute semapho e 1oS^^S.«S?2S ""J ^ emb ° diment ' ,he mecha nism 
described below (other such synchronisation 2^^^ i2Z2£^ » . 

Description of Element s in System Architecture 

itself are carried out in the coprocessor S£ ^S^^SSTT tT^ ^ " StSpS h ,he com Puta.ion 
for particular tasks: configuration of the burst buf£ ; hr ° U P h the bUrS ' inStrUCtion c ' ueue 6 ' instructions 
5 and the main memory 3 Furthermore TLuTLT ' ^ ° f data be,Ween the burst buffer me ™ory 

instructions for furthe^^^ ^^ZT^^^ ~ I "° ^ 

sua- ™i r sxzzze^ 2 accesses da,a -^^^z^r^ on coproc - 

UrSesIor 2 and e 6 us C :^^^ the processor 7from the operation of 

5. The specific detail of thi ar angemenJ tSS^iSS^Tl^^ " 1mm th6 bUfSt bUfferS 

bejow in the context o, the o^ P U^~Z IS^^S^T^ " ^ 

tion No. GB98/00248, US Patent Applica bn Ito^^T?™^ GB98/00262 ' International Patent Applica- 

rather than instructions ieiating to detai I o h cateuCn To Vun fo T 7 ^ ° f ,h6 co P rocess - 2 - 

queue 8. The CHESS coprocessor 2 runs unriPMhJ ^ . , f B .u y °' eS } ,r ° m the co P r ocessor instruction 

data through inte rac tio^ Z^ZTsTte WESS^rl C ° M < 9 and ™*« and stores 

output stream. This can be an efficie procei iZ^o^^S^S* °" l ° P roduce a " 

T^e detailed operation of computation^ccording to ^3^^^^ ' S "** ^ M 

^ LmoryisTrov^Id aTrjRAM ^ Effective a a c^^rhT D R C A a M he 4 * * 3 C ° nVen,i ° nal ^ ^ 

been described in European Patent Applied ^fon No SS^L ^ h by b " rSt 5 BurSt buffers have 

09/3,526, filed on 6 January 1998 which iS, Jl! f 3 COrres P ondi n9 US Patent Application Serial No. 

by law. The burst buffTalZu^ J^" »«* P-m-— 

referred to these earlier applications t3 " S ° f thlS archltec t"re the reader is 

E0022] The burs, buffer architecture is useful, bu, not fundamental, to the operation of the present invention as de- 
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scribed in these embodiments. In the context of the present invention, the most significant aspect of the burst buffers 
architecture is that the burst buffers 5 operate according to instructions from the processor 1 , and that these instructions 
are provided by means of a queue (or alternative, as discussed below). This mechanism allows for the possibility of 
decoupling of the processor 1 from operation of the burst buffers 5 in an appropriate architecture. 

5 [0023] The elements of the version of the burst buffers architecture (variants are available, as is discussed in the 
aforementioned application) used in this embodiment are shown in Figures 2 and 3. A connection 12 for allowing the 
burst buffers components to communicate with the processor 1 is provided. Memory bus 16 provides a connection to 
the main memory 3 (not shown in Figure 2). This memory bus may be shared with cache 4, in which case memory 
datapath arbiter 58 is adapted to allow communication to and from cache 4 also. 

io [0024] The overall role of burst buffers in this arrangement is to allow computations to be performed on coprocessor 
2 involving transfer of data between this coprocessor 2 and main memory 3 in a way that both maximises the efficiency 
of each system component and at the same time maximises the overall system efficiency. This is achieved by a com- 
bination of several techniques: 

is burst accesses to DRAM, using the burst buffers 5 as described below; 

simultaneous execution of computation on coprocessor 2 and data transfers between main memory 3 and burst 
buffer memory 5, using a technique called "double buffering"; and 

20 decoupling the execution of processor 1 from the execution of coprocessor 2 and burst buffer memory 5 through 

use of the instruction queues. 

[0025] "Double buffering" is a technique known in, for example, computer graphics. In the form used here it involves 
consuming - reading - data from one part of the burst buffer memory 5, while producing - writing - other data into a 
2S different region of the same memory, with a switching mechanism to allow a region earlier written to now to be read 
from, and vice-versa. 

[0026] A particular benefit of burst buffers is effective utilisation of a feature of conventional DRAM construction. A 
DRAM comprises an array of memory locations in a square matrix. To access an element in the array, a row must first 
be selected (or 'opened'), followed by selection of the appropriate column. However, once a row has been selected, 

30 successive accesses to columns in that row may be performed by just providing the column address. The concept of 
opening a row and performing a sequence of accesses local to that row is called a "burst". When data is arranged in 
a regular way, such as in media-intensive computations (typically involving an algorithm employing a regular program 
loop which accesses long arrays without any data dependent addressing), then effective use of bursts can dramatically 
increase computational speed. Burst buffers are new memory structures adapted to access data from DRAM through 

35 efficient use of bursts. 

[0027] A system may contain several burst buffers. Typically, each burst buffer is allocated to a respective data 
stream. Since algorithms have a varying number of data streams, a fixed amount of SRAM 26 is available to the burst 
buffers as a burst buffer memory area, and this amount is divided up according to the number of buffers required. For 
example, if the amount of fixed SRAM is 2 Kbytes, and if an algorithm has four data streams, the memory region might 
40 be partitioned into four 512 Byte burst buffers. 

[0028] In architectures of this type, a burst comprises the set of addresses defined by: 

burst = {B + S X i I B,S,i G N a 0 ^ i < L} 

45 

where B is the base address of the transfer, S is the stride between elements, L is the length and N is the set of natural 
numbers. Although not explicitly defined in this equation, the burst order is defined by / incrementing from 0 to L-1. 
Thus, a burst may be defined by the 3-tuple of: 
(base_address, length, stride) 

so [0029] In software, a burst may also be defined by the element size. This implies that a burst maybe sized in bytes, 
halfwords or words. The units of stride must take this into account. A "sized-bursf is defined by a 4-tupfe of the form: 
(base_address, length, stride, size) 

[0030] A "channel-burst" is a sized-burst where the size is the width of the channel to memory. The compiler is 
responsible for the mapping of software sized-bursts into channel-bursts. The channel-burst may be defined by the 
55 4-tuple: 

(base_address, length, stride, width) 

[0031] If the channel width is 32 bits (or 4 bytes), the channel-burst is always of the form: 
(base_address, length, stride, 4) 
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or abbreviated to the 3-tuple (base_addres Sj length, stride) 

■ mmasm 

™S?|££S3 bln9 1 69 ' 0 " 5 °' ma " bu,s,in 9 '» a "° "° m »• burst buff.r memo* .TXfS Ac?«s 

[0033] A burst buffers arrangement which did not emolov MATs and ratc ^oi.^k ^ i s , _ 

EL BUrSt ' nS, : U f ,i0ns ori 9 ina «ng from the processor l are provided to the burst buffers 5 by means of a burst 

« ^^^i'^^ir^; burst ins,ruc,ion queue 6 are processed b * a buff - «^ 54 To 

edsters 52 In o mJT t h 6 B f 66 bU " er COnlr0 " er a ' SO receives con,rol in P uls •"«*« burst oonuS 
registers 52. Information contained in these two tables is bound together at run time to describe a cbmnirt™iT 

buTblfs meX a- 26. ^ ^ 58 * ^ ,ranSaCti ° nS ^ *» ™ in 3 and *• 

K J*? k8 !, b ! J ? inS,r i UCtions aro ,hosc used to data from main memory 3 to the burst buffer memory area 

and The ZE" ?"! "~ 26 te ^ ™" 3 These « "^ st- 

and storeburst The loadburst instruction causes a burst of data words to be transferred from a determined location 

in the memory 3 to that one of the burst buffers. There is also a corresponding storeburst instruction w^ chcausel a 

burst of data words to be transferred from that one of the burst buffers to the memory 3. beginning ^^^^1 address 

S£ZS3£2E ™ re of Fi9ure 1 • addi,ional sy " n in ~ « a - iSSSSS 

[0036] The instructions loadburst and storeburst differ from normal load and store instructions in that thev comolete 
n a single eye e. even though the transfer has not occurred. In essence, the loadburst and iSSSS 

3S SUE : %u "] terfaCe 16 ,0 Perf ° rm the burst ' but ,ne V do not wait for the burst to complete 

[0037] The fundamental operation is to issue an instruction which indexes to two table entries one in each of the 
memory access and buffer access tab.es. The index to the memory access table retrieves Se base address extern 

Sin Z e b U ur B e t ^« th8 mem0rV ° f tranSf8r Th8 indSX l ° lhe buffer access table brieves ft ST address 
wrthm the buret buffe memory region. In the embodiment shown, masking and offsets are provided to the index va ues 

40 nL TV 1^ (th ' S ' S diSCUSSSd ,Urth8r in Eur ° pean Patent Application No. 9730951 4 4), althoughTil possib e to 
40 use actual addresses instead. The direct memory access (DMA) controller 56 is passed the paramete s from the two 

tables and uses them to specify the required transfer. parameiers rrom tne two 

[0038] Table 1 shows a possible instruction set. 
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. I 


| Opcode 


Parameter Value 


~ I 


10 


BB LOADBURST 


mat index (integer), 
bat_index (integer), 
blockincrement (boolean) 


Load a burst of data into the burst 
buffer memory from main 
memory, and optionally 
increments the base address in 
main memory 


15 


BB_STORJEBURST 


matindex (integer), 
bat_ index (integer), 
blocks increment (boolean) 


Store a burst of data into main 
memory from the burst buffer 
memory, and optionally 
increments the base address in 


20 






main memory 




BBLXINCREMENT 


N/A 


Increment the value of the LX 
semaphore 




BBXSDECREMENT 


N/A 


Decrement the value of the XS 
semaphore 




BB_SET_MAT 


entry (integer), memaddr (integer), 
extent (integer), stride (integer) 


Sets a MAT entry to the desired 
values 


30 


BB_SET_BAT 


entry (integer), bufaddr (integer), 
extent( integer) 


Sets a BAT entry to the desired 
values 



Table 1 : Instruction set for burst buffers 

[0039] The storeburst instruction (BB_STOREBURST) indexes parameters in the MAT and BAT, which define the 
characteristics of the requested transfer. If the block Jncrementb'W is set, the memaddr fie Id of the indexed entry in the 
MAT is automatically updated when the transfer completes (as is discussed below). 

[0040] The loadburst instruction (BB_LOADBURST) also indexes parameters in the MAT and BAT, again which define 
the characteristics of the required transfer. As before, if the blockjncrement bit is set, the memacfc/rfield of the indexed 
entry in the MAT is automatically updated when the transfer completes. 

[0041] The synchronisation instructions needed are provided as Load-Execute Increment and eXecute-Store Dec- 
rement (BB_LX_INCREMENT and BB_XS_DECREMENT). The purpose of BB_LX_I NCREME NT is to make sure that 
the execution of coprocessor 2 on a particular burst of data happens after the data needed has arrived into the burst 
buffer memory 5 following a loadburst instruction. The purpose of BB_XS__DECREMENT is to make sure that the 
execution of a storeburst instruction follows the completion of the calculation (on the coprocessor 2) of the results that 
are to be stored back into main memory 3. 

[0042] In this embodiment, the specific mechanism upon which these instructions act is a set of two counters that 
track, respectively: 

the number of regions in burst buffer memory 5 ready to receive a storeburst; and 
the number of completed loadburst instructions. 

[0043] Requests for data by the coprocessor 2 are performed by decrementing the LX counter, whereas the availa- 
bility of data is signalled by incrementing the XS counter. These counters have to satisfy two properties: they must be 
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C ° mPC "* n ' " 9iVOT «" .o euspand „e process 

New York: Academic Press. (1968) panes 43-112 T^!™ ..""Tr" (E0 "°"' p '°9rammlng Languages, 

amploysd in embodiments ol Ihs SemS M I en ud IT^Z h 13 <° "asc*a Ih. counrars 

ssmaphorss described by D.J «„ ™ ££U fnC- C< " ,n "" S " M id " tol "> lhe 

it Exacting a Wail () on a semaphore »„o2 vaTn. S 35S 0 sto.T.^f 9 " ' 9na ' 0 ' nS " UCtton 

M> » rryino ,o axacuta ,h. «J „ «, tne ^ZT^,^,ltS17 ^ °' C °^°"™> 

CrT~i~ 

different implementations are of course possib.e. ^^ZSS^St^^S- ^ *° U * 

3. Stride |ilr*tej-lhe interval between eucceeshre alemenls in a transfer. 

mamaddr Tbis is lb. 32 bit unsigned, word-alignsd address o. the lira, alemant of Iba cbannel buret. 

[0048] An example of values contained by a MAT slot might be 
{Oxlfeelbad, 128, 16} 

r004m Wh ^ rGS f S ^ 3 32 W ° rd (32 4 byte W ° rdS) bUrSt ' with each word separated by 4 words (4 4 byte wordsl 

[0050] The bufler access table (BAT) 66 will now be described with reference to Fio. Jm Th.c 
descriptor table, in this case ho,din 9 information relating to the burst wlZZ£ a re^ 26 Each ^3^66 
describes a transaction to the burst buffer memorv area Pfi Ac w th* ha at \u ^ f^' V AT 66 

though can o, coure. be varied as * ,b. JfMi^ CT£ ^^.^Sr™ 1 *- ' 6 



8 



'BNSDOCID; <EP__ 1061438A1J_> 




EP 1 061 438 A1 



1 . Buffer address (bufaddr) - the start of the buffer in the buffer area 

2. Buffer size (bufsize) - the size of the buffer area used at the last transfer 

s [0051] The buffer address parameter bufaddr is the offset address for the first element of the channel-burst in the 
buffer area. The burst buffer area is physically mapped by hardware into a region of the processor's memory space. 
This means that the processor must use absolute addresses when accessing the burst buffer area. However, DMA 
transfers simply use the offset, so it is necessary for hardware to manage any address resolution required. Illegally 
aligned values may be automatically aligned by truncation. Reads of this register return the value used for the burst 

10 (i.e. if truncation was necessary, then the truncated value is returned). The default value is 0. 

[0052] The parameter bufsize is the size of the region within the buffer area occupied by the most recent burst. This 
register is automatically set on the completion of a burst transfer which targeted its entry. Note that the value stored is 
the burst length, since a value of 0 indicates an unused buffer entry. This register may be written, but this is only useful 
after a context switch when buffers are saved and restored. The default value is again 0. 

15 [0053] Programming MAT and BAT entries is performed through the use of BB_SET_MAT and BB_SET_BAT in- 
structions. The entry parameter determines the entry in the MAT (or BAT) to which the current instruction refers. 
[0054] Further details of the burst buffer architecture and the mechanisms for its control are provided in European 
Patent Application No. 97309514.4 and the corresponding US Patent Application Serial No. 09/3,526. The details 
provided above are primarily intended to show the architectural elements of the burst buffer system, and to show the 

20 functional effect that the burst buffer system can accomplish, together wilh the inputs and outputs that it provides. The 
burst buffer system is optimally adapted for a particular type of computational model, which is developed here into a 
computational model for the described embodiment of the present invention. This computational model is described 
further below. 

[0055] The burst instruction queue 6 has been described above. A significant aspect of the embodiment is that 
25 instructions are similarly provided to the coprocessor through a coprocessor instruction queue 8. The coprocessor 
instruction queue 8 operates in connection with the coprocessor controller 9, which determines how the coprocessor 
receives instructions from the processor 1 and how it exchanges data with the burst buffer system 5. 
[0056] Use of the coprocessor instruction queue 8 has the important effect that the processor 1 itself is decoupled 
from the calculation itself. During the calculation, processor resources are thus available for the execution of other 
30 tasks. The only situation which could lead to operation of processor 1 being stalled is that one of the instruction queues 
6,8 is full of instructions. This case can arise when processor 1 produces instructions for either queue at a rate faster 
than that at which instructions are consumed. Solutions to this problem are available. Effectiveness can be improved 
by requiring the processor 1 to perform a context switch and return to service these two queues after a predefined 
amount of time, or upon receipt of an interrupt triggered by the fact that the number of slots occupied in either queue 
35 has decreased to a predefined amount. Conversely, if one of the two queues becomes empty because the processor 
1 cannot keep up with the rate at which instructions are consumed, the consumer of those instructions (the coprocessor 
controller 9 or the burst buffer controller 7) will stall until new instructions are produced by the processor 1 . 
[0057] Modifications can also be provided to the architecture which ensure that no further involvement from the 
processor 1 is required at all, and these will be discussed in the final part of this specification. 

[0058] The basic functions of the coprocessor controller 9 are to fetch data from the burst buffer memory 5 to the 
coprocessor 2 (and vice versa), to control the activity of the coprocessor, and to synchronise the execution of the 
coprocessor 2 with the appropriate loads from, or stores to, the burst buffer memory 5. To achieve these functions, the 
coprocessor controller may be in essence a relatively simple state machine able to generate addresses according to 
certain rules. 

[0059] Figure 4 shows the coprocessor controller 9 in its relationship to the other components of the architecture, 
and also shows its constituent elements and its connections with other elements in the overall architecture. Its exact 
function depends on the type of inputs and outputs required by the coprocessor 2 and its initialisation requirements (if 
any), and so may vary in detail from that described below. In the case of a CHESS coprocessor, these inputs and 
outputs are input and output data streams exchanged with the burst buffer memory 5. 
so [0060] Coprocessor controller 9 performs two main tasks: 

control of the communication between the coprocessor 2 and the burst buffer memory 5; and 
maintenance of a system state through the use of a control finite state machine 42. 



40 



45 



55 



[0061] The coprocessor 2 accesses data in streams, each of which is given an association with one of a number of 
control registers 41 . Addresses for these registers 41 are generated in a periodic fashion by control finite state machine 
42 with addressing logic 43, according to a sequence generated by the finite state machine 42. 
[0062] At every tick of a clock within the finite state machine 42, the finite state machine gives permission for (at 
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S« th h ',T„ 10 haVe 3 nSW addrGSS 9 enerated for it and the address used to allow the register 41 to 

macZ 5? h 5 At ^ time ' 30 a PP r °P^ ^trol signal is generated by "he f n .e sta e 

She ™ ? S6n ' t0 9 mU,,ip,eXer 44 S ° that ,he a PPr°P*ate address is sent to the burst buffer memory 5 tooethe^ 

sr sr^r ^rj^ssr si9nal is ~ d wi,h each re9is,er 41 • 

foTs value" o'ene^hesam 131 : 6 !'" * hM bMn ^ * addrSSS memor » a constan < Quantity is added 

5 fhit s iHhe wfd J onh 3 ? , 6 ' ? COnnection betwee " th e coprocessor 2 and the burst buffer memory 
5. That is if the width of th,s connection is 4 bytes, then the increment made to counter 41 will be 4 This is essential 
comparable to "stride" in the programming of burst buffers. essent.ally 
[0064] The coprocessor controller mechanism described above allows the multiplexing of different data streams 

£m S Fc lT, T\ 0i d8la StreamS Ca " be C ° nSid6red t0 aCCess the Single ehare'd bus Jhrc ughToS^T 
k k y ° Pera,e SUCh that ,hS intS9ri,y ° f communication is ensured, it is necessary that at 7s other 

sibfmJ niTt r °?T ° f 2 " feady 10 ^ ' r ° m and Write to this bus in a synchronous manner t is the Ispon 
V*eZT*St ^ (and SPeCiMCa,ly ' l ° ,hS Part °* ,hS app,ica,i0n S0f,ware that c -«9"res coprocessor 

no two streams try and access the bus at the same time and that 

the execution of coprocessor 2 .s synchronous with the data transfer to and from burst buffer memory 5. 

S£!2L, ! atl8r requiremenl ensuies ,hal lhe coprocessor 2 is ready to read the data placed by the bursl buffers 
memory 5 on the connection between the two devices, and vice-versa 

hTJ m A " hOU fr e than ° nS PhySICa ' Hne COUld use,u "y be P rovided be,wee n 'he Chess array 2 and the burst 
betwLTth ' 96n o ral '° r mul "P lexin 9 wou,d t'» ^main. Unless the number of ph^S^SZ 
streTmrtor ,h° PrOCeSSOr T ^ bU ™ mdm °' y 5 " 9 r ^ er than or equal to the total number «?£j2tS 

samTwirl t!h ^ C ° prOCGS 1 SOr 2 ' " Wi " a ' WayS b ° trU ° that ,W ° ° r more '°9 ical streams have to be multiplexed on he 
memory 5) discourage the use of more than one connection with the coprocessor 2 

J! 16 <; 0 P r0 f ssor controller 9 also acts to control the execution of the CHESS array comprising coprocessor 
2 so that it will run for a specified number of clock cycles. This is achieved by the counter in the o^troTSS ^ 
machine 42 icking for the specified number of cycles before "freezing- the CHESS array by "gal Tng^haH opp SS 
s internal clock, m a way that does not affect the internal state of the pipelines in the coprocesso?2 This nuSer o 
ticks is specified using the CC_START_EXEC instruction, described below. 

[0069] Coproceesor controller 9 is programmed by processor 1 through the use of the coprocessor instruction queue 
8. A possible ,nstruct.on set for this coprocessor controller 9 is shown in Table 2 below. lnsIruc t'on queue 
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| Op 


Parameter Value 


Comment i 


CC_CURRENT_PORT 


n (integer) 


Port # the next CC_PORT_xxx 
commands will refer to 


CCJ»ORT_PERIOD 


(integer) 


Period of activity of a port 


CC_PORT_PHASE_START 


slarl (integer) 


Phase start of the activity of a port 


CC_PORT_PHASE_END 


end (integer) 


Phase end of the activity of a port 


CC_PORT_TlME_START 


tstaxt (integer) 


Start cycle of the activity of a port 


CC PORT TIME END 


(integer) 


End cycle of the activity of a port 


CC_PORT_ADDRESS 


addr start (integer) 


Initial address for a port 


CC_PORT_ INCREMENT 


addr^H (integer) 


Address increment for a port 


CC_PORT_ IS WRITE 


rw (boolean) 


Read/Write flag 


CC_START_EXEC 


ikycies (integer) 


Start/Resume the execution of 
coprocessor 2 for a determined # of 
cycles 


CC_LX_DECREMENT 


N/A 


Decrement the value of the LX 
semaphore 


CCXSINCREMENT 


N/A 


Increment the value of the XS 
semaphore 



35 



Table 2: Coprocessor controller instruction set 
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[0070] For the aforementioned instructions, different choices of instruction format could be made. One possible for- 
mat is a 32-bit number, in which 16 bits encode the opcode, and 1 6 bits encode the optional parameter value described 
above. 

[0071] The semantics of individual instructions are as follows: 
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CC_CURRENT_PORT selects one of the ports as the recipient of all the following CC_PORT_xxx instructions, 
until the next CC_CURRENT_PORT 

CC_PORT_PERIOD () sets the period of activation of the current port to the value of the integer parameter 
CC^.PORT_PHASE_START/CC_PORT_PHASE_END ( star t en d ) set the start/end of the activation phase of 
the current port to the value of the integer parameter ( start end ) 

CC_PORT_TIME_START/CC_PORT_TIME_END (l start t end ) set the firsl/lasl cycle of activity of the current port 
CC_PORT_ADDRESS (addr start ) sets the current address of the current port to the value of the integer parameter 

addr start 

CC_PORT_INCREMENT (addr incr ) sets the address increment of the current port to the value of the integer pa- 
rameter addr jncr 

CC_PORTJ S_WRITE (rw) sets the data transfer direction for the current port to the value of the Boolean parameter 
rw 

CC_START_EXEC n cyc | es initiates the execution of coprocessor controller 2 for a number of clock cycles specified 
by the associated integer parameter n cycles ; 

CC_LXS_DECREMENT decrements (in a suspensive manner, as previously described) the value of the LX sem- 
aphore; 
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♦ CC_XSS_INCREMENT increments the value of the XS semaphore. 

[0072] A port is defined as active (that is, it has control of the communication with the burst buffer memorv 5) if the 

^T^SZShJ" T ,hat ts T V< "~ Start m ° d >< ThiS a,l °- ^ poss b , y o' 

the Z J ™hS JnH tK .k ' L 6amS eX ' St W ' th 8qUal period ' say 5 ' and one has control of the BB memory for 
the first 4 cycles, and the other has control for the remaining cycle 

S ™ 8 n P ^ aSS ° f exec V t J. fl 1 . an algorithm usina this architecture involves first the programming of the coproc- 

ES!! 4 J r°L the ini,ialisation of * e coprocessor 2, it will generally be most straightforward for the configuration to be 
loaded mto the coprocessor itself by means specific to the actual embodiment of the device ont, 9 uratlon 
L0075] For the programming of the coprocessor controller 9, the steps are as follows: 

in^Tl COprOCe f sor c 1 ontro,ler 9 is configured according to the total number, periods, phases and address 
mcrements for every logical stream present in the Chess array, as described before. An example of the program- 
m.ng of the coprocessor controller 9 to perform the desired functions is provided below. 

UrLTl St6 , P in / thec ° nfi 9 ura,ion of coprocessor controller 9 is address configuration. Although it is likely that 

IddS 3 ; ^ "1 PhaS6) °' SVery '° 9iCal S,r6am Wi " remain the same throughout an algorithm, the actual 

addresses accessed by the coprocessor controller 9 in the burst buffers memory 5 will vary. 5 is this variability 
Si ! -urst buff ers controller 7 to perform double-buffering in a straightforward manner within the burst 

* f ] ° f th ' S doub,e - bufferin 9, as previously stated, is to give the coprocessor 2 the 

impression that ,t is interacting wrth continuous streams, whereas in fact buffers are being switched continuously. 

S^i .h Th8 6U , rSt b , Uffe t rS COntr0 " 0r 7 a,S ° ne6dS ,0 bS confi 9 ured - To °o this, the appropriate commands have to be 
hurlt h, I ' n qU9Ue 6 ° rdSr l ° COnfi9Ure the ,ransfers of data to and from main memory 3 into the 

th BAT rZ ^ ThSSe lnStrUCti ° nS (BB_SET_MAT and BB_SET_BAT) configure the appropriate entries within 
the BAT and the MAT, in a manner consistent with the programming of the coprocessor controller 9. In this embodiment 

possibility would be the use of memory-mapped registers which the processor 1 would write to and read from As in 
the present embodiment there is no possibility of reading from memory-mapped registers (as they are not present) 

Z ^hkT ^ State ° f bUfSt bUff6r COn,r °" er 7 " however ' this is not a significant limitation. Further- 
more, the use of the burst instruction queue 6 for this purpose allows the possibility of interleaving instructions to 
configure MAT and BAT entries with the execution of burst transfers, thus maintaining correct temporal semantics 
without the supervision of the processor 1 . semanucs 

^k 771 ^ artnese steps have been Performed, the actual execution of the CHESS array can be started It is necessary 
in this embodiment only to instruct the CHESS array to run for a specified number of cycles. This is achieved by writing 
XiS T ,K yC T 35 3 Param8,er W 8 CC - STAR T-EXEC instruction in the coprocessor instruction queue 8 
so that this data can then be passed to the coprocessor controller 9. One clock cycle after this value has been transferred 
into coprocessor controller 9, the controller starts transferring values between the burst buffer memory 5 and the CHESS 
array of coprocessor 2, and enables the execution of the CHESS array. 

[0078] An important step must however be added before instructions relating to the computation are placed in the 
respect.ve mstruct.on queues. This is to ensure the necessary synchronisation mechanisms are in place to implement 
successfully the synchronisation and double-buffering principles. The basic element in this mechanism is that the 
coprocessor control.er 9 will try to decrement the value of the LX semaphore and will suspend coprocessor operation 
unt. .t can do so, according to the logic described above. The initial value of this semaphore is 0: the coprocessor 
controller 9 and the coprocessor 2 are hence "frozen" at this stage. Only when the value of the LX semaphore is 
incremented by the burst buffers controller 7 after a successful loadburst instruction will the coprocessor 2 be able to 
start (or resume) its execution. Toachievethis effect, a CC_LX_DECREMENT instruction is inserted in the coprocessor 
instruction queue 8 before the "start coprocessor 2 execution" (CC_START_EXEC) instruction. As will be shown a 
corresponding increment the LX semaphore" (BB_LX_INCREMENT) instruction will be inserted in the burst instruction 
queue 6 after the corresponding loadburst instruction. 

[0079] The actual transfer of data between CHESS logical streams and the burst buffer memory 5 is carried out in 
accordance with the programming of the coprocessor controller 9 as previously described 

[0080] The number of ticks for which the counter 42 has to run depends on how long it takes to consume one or - 
more input bursts. | t ,s left to the application software to ensure the correctness of the system. The programming of 
the counter 42 must be such that, once a buffer has been consumed, the execution of coprocessor 2 will stop The 
next instruction in the coprocessor instruction queue 8 must be a synchronisation instruction (that is a 
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CCJJCJDECREMENT), in order to ensure that the next burst of data has arrived into the burst buffers memory 5. 
Following this instruction (and, possibly, a waiting period until the data required is available), the initial address of this 
new burst of data is assigned to the data stream (with a CC_PORT_ADDRESS instruction), and execution is resumed 
via a CC_START_EXEC instruction. The procedure is similar for output streams (with the important difference that 
there will be no waiting period equivalent to that required for data to arrive from main memory 3 into burst buffers 
memory 5). 

Computational Model 

[0081] An illustration of the overall computation model will now be described, with reference to Figure 5. The illus- 
tration indicates how an algorithm can be recoded for use in this architecture, using as an example a simple vector 
addition, which can be coded in C for a conventional microprocessor as: 



[0082] A piece of C code to run processor 1 which achieves on the architecture of Figure 1 the same functionality 
as the original vector addition loop nest is as follows: 

25 
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int a[1024], b[1024], c[1024] ; 
for (i=0; i<1024 ; i + +) 
ati]=b[i]+c[i] ; 
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int a[1024], b[1024j c[1024]/ " ■ 

int eo, not_eo, k; 

CIQ_ ST REAM( 0, 4/ 4, 3, 0,1, 0 , 3 *BLEN*MAXK+3 , 0 ); 



/*Port l specification*/ 
CIQ_STREAM( 1, 4, 4, 3, 1, 
/*Port 2 specification*/ 



2, 0, 3*BLEN*MAXK+3, 0 ) 



CIQ_STREAM( 2. 
B IQ_SET_MAT ( 0 
BIQ_SET_MAT(1 / 
BIQ_SET_MAT(2, 
BIQ_SET_BAT(0, 
BIQ_SET_BAT{2, 
BIQjSET_BAT(4, 
for( k = 0; k 



4, 4. 
&b[0] 
&c [0] 
&a [0] 



3, 2, 3, 
BLEN*4 , 
BliEN*4 , 
BLEN*4, 



{ 



0x0000, BLEN*4) 
0x0200, BLEN*4) 
0x0400, BLEN*4) 
: MAXK; k++ ) 



0, 
4) 
4) 
4) 



3 *BLEN*MAXK+3 , 1 ) 



BIQ_SET_BAT(1, 
BIQ_SET_BAT(3, 
BIQ_SET_BAT(5, 



0x0100, 
0x0300, 
0x0500, 



BLEN*4) 
BLEN*4) 
BLEN*4) 



eoT^^xf d iteration? - double buffering*/ 

CIQ_LXD<2) ,' 

CIQ_SA{0, (BLEN*4*eo) ) ; 
CIQ_SA<1, ( (2*BLEN*4) +BLEN*4*eo) ) - 
CIQ_SA(2, ((4*BLEN*4)+BLEN*4*eo)) ■ 
/*Start Chess*/ 
CIQ_ST(3*BLEN) ; 
CIQ_XSI<1) ; 
/*BB Stuff*/ 
/*Load A*/ 
BIQ_FLB(0,eo) ; 
/*Load B*/ 
BIQ_FLB(2,2+eo) ; 
BIQ_LXI(2) ; 
if ( k >= l ) 
{ 



} 



not_eo s ( eo «0) ?1:0; 

BIQ_XSD<1); 

BIQ_FSB (4 , 4+not_eo) ; 
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) 

eo = MAXK & 0x1; 
not_eo = (eo==0)?l:0; 
BIQ_XSD(1); 

BIQ_FSB(4,4+not_eo) ; 



T^uS 

when expanded, result in the falS Z ^^m^Z^ TT '! iniliaHSe ^ P ° r ' S TheSe ' 
directly analogous): ' ,e,erence to »"e 4 - the other expanded macros are 

CC_CURRENT_PORT(0); 

CC_PORTJNCREMENT(4); 

CC_TRANSFER_SIZE(4); 

CC_PORT_PERIOD(3); 

CC_PORT_PHASE_START(0); 

CC_PORT_PHASE_END(1 ); 

CC_PORT_START_TIME(0)'; 
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CC_PORT_END_TIME(3*BLEN*MAXK+3); 
CC_PORTJ S_WRI TE (0); 

[0084] This code has the effect that port 0 will read 4 bytes of data every 3 rd tick of counter 42, and precisely at ticks 
5 0, 3, 6 ... 3*BLEN*MAXK+3, and will increment the address it reads from by 4 bytes each time. BLEN*MAXK is the 
length of the two vectors to sum (in this case, 1024), and BLEN is the length of a single burst of data from DRAM (say, 
64 bytes). With these values, MAXK will be set to 1024/64=16. 

[0085] Lines 9 to 1 4 establish MATs and BATs for the burst buffers transfers, tying entries in these tables to addresses 
in main memory 3 and burst buffers memory 5. The command BIQ_SET_MAT(0, &b[0], BLEN*4, 4, TRUE) is a code 
io macro that is expanded into BB_SET_MAT(0, &b[0], BLEN*4, 4) and ties entry 0 in the MAT to address &b[0], sets the 
burst length to be BLEN*4 bytes (that is, BLEN integers, if an integer is 32 bits) and the stride to 4. The two lines that 
follow are similar and relate to c and a. The line BIQ_SET_BAT(0, 0x0000, BLEN*4) is expanded to BB_SET_BAT(0, 
0x0000, BLENM) and ties entry 0 of the BAT to address 0x0000 in the burst buffers memory 5. The two lines that follow 
are again similar. 

15 [0086] Up to this point, no computation has taken place; however, coprocessor controller 9 and burst buffers controller 
7 have been set up. The loop nest at lines 15 to 38 is where the actual computation takes place. This loop is repeated 
MAXK times, and each iteration operates on BLEN elements, giving a total of MAXK*BLEN elements processed. The 
loop starts with a set of instructions CIQ_xxx sent to the coprocessor instruction queue 8 to control the activity of the 
coprocessor 2 and coprocessor controller 9, followed by a set of instructions sent to the burst instruction queue 6 

so whose purpose is to control the bursl buffers controller 7 and the burst buffers memory 5. The relative order of these 
two sets is in principle unimportant, because the synchronisation between the different system elements is guaranteed 
explicitly by the semaphores. It would even be possible to have two distinct loops running after each other (provided 
that the two instruction queues were deep enough), or to have two distinct threads of control. 

[0087] The CIQ_xxx lines are code macros that simplify the writing of the source code. Their meaning is the following: 



CtCLLXD(N) inserts N CC_LXS_DECREMENT instructions in the coprocessor instruction queue 8; 
CIQ_SA(port, address) inserts a CC_CURRENT_PORT(port) and a CC_PORT_ADDRESS(address) instruction 
in the coprocessor instruction queue 8; 

CIQ_ST(cycleno) inserts a CC_EXECUTE_START(cycleno) instruction in order to let the coprocessor 2 execute 
for cycleno ticks of counter 42; and 

CIQ_XSI(N) inserts N CC_XSS_!NCREMENT instructions in the coprocessor instruction queue 8. 
[0088] The net effect of the code shown above is to: 

synchronise with a corresponding foadburst on the LXS semaphore; 

start the computation on coprocessor 2 for 3*BLEN ticks of counter 42; and 

synchronise with a corresponding storeburst on the XSS semaphore. 

[0089] The BIQ_xxx lines are again code macros that simplify the writing of the source code. Their meaning is as 



BIQ_FLB(mate,bate) inserts a BB_LOADBURST(mate, bate, TRUE) instruction into the burst instruction queue 6; 
BIQ_LXI(IM) inserts N BB_LX_INCREMENT instructions in the burst instruction queue 6; 

BIQ_FSB(mate,bate) inserts a BB__STOREBURST(mate, bate, TRUE) instruction into the burst instruction queue 
45 6; and 

BIQ_XSD(N) inserts N BB_XS_DECREMENT instructions in the burst instruction queue 6. 

[0090] The net effect of the code shown above is to load two bursts from main DRAM memory 3 into bursl buffers 
memory 5, and then to increase the value of the LX semaphore 10 so that the coprocessor 2 can start its execution 
50 as described above. In all iterations but the first one, the results of the computation of coprocessor 2 are then stored 
back into main memory 3 using a storeburst instruction. It is not strictly necessary to wait for the second iteration to 
store the result of the computation executed in the first iteration, but this enhances the parallelism between the co- 
processor 2 and the burst buffers memory 5. 

[0091] The use of the two variables eo and not_eo is a mechanism used here to allow the double-buffering effect 
55 described previously. 

[0092] Lines 39 to 42 perform the last burst transfer to main memory 3 from burst buffers memory 5, compensating 
for the absence of a storeburst instruction in the first iteration of the loop body. 

[0093] The resulting timeline is as shown in Figure 6. Loadbursts^ are the first activity (as until these are completed 
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This process eon.inues Th'oSom SocT 9 """ CCm " u,atic " <" «"-" ' >»= eonplMed. 



Tool chain for computation 



« XS'Sr" " JHDL ; A " HDLf ° r ReCOnfi 9 urable perns' by Peter Bellows'and Brad H tchings 
rO^Xl^fw ^ mpOS '" m ° n F *%«te Custom Co/Wng M.c/./ne S , April 1 998. 

ex a rr a ;ccr a rs^ ™ — - — 

EL ™ has ,he lun< *°" °< ooovertios conventional sequential code to code ed.pled specifcalh, for 

- 01 - oeraponen,s The —*■»«— 

a CHESS coprocessor configuration for execution of the computation 

burst buffer schedule for moving data between the system memory and the burst buffer memory and 
acoprocassorcontro,,^ 

a^Ld^inH hain itS6lf k 38 tW ° com P° nen,s - The <"* <s a frontend. which takes C code as its input and provides 

- SSS^ th6Se CHESS C ° nfi9Uration ' ,he burs < -hedule, and L 
K!5» m J" task ° f the ,rontend is to 9 ener ate a graph which aptly describes the computation as it is to haooen 
D Xnn^otT £ p 6 °V S T in S18PS Perf0rmed iS value - b ^ed dependence analysis, as described in W C and 
SitSJE 2 ' ^ 3Ct M8th0d f ° r Ana ' ySiS ° f Value " b ased Array Data Dependences", Univers^f MaS In 
st, tute for Advanced Computer Studies - Dept. of Computer Science, University of Maryland December 1 99? Vhl 
oul P"» generated is a description of the dataflow to be implemented in the CHESS army and a representafon olaH 
he addresses that need to be loaded in as inputs (via loadburst instructions) or stored to a 

nstructions), and of the order in which data has to be retrieved from or stored to the ^ZS^T^^SSl 
upon which an effic.ent schedule for the burst buffers controller 7 will be derived 
[0100] If we assume, as an example, the C code for a 4-tap FIR filter: 

int i, j, src[], kernel [] , dst [] ; 
for( i = 0; i < 1000; i++ ) 

so 

for( j = 0; j < 4; j++ ) 

dst[i] = dstfi] + arc [4 + i-j]*kernel [j] ; 

ss as the input to the frontend, the output, provided as a text file, will have the following form: 
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loop: 0<=i<999 #loop nest description 
loop: 0<=j<4 

16 : str/0/0/20/ #store instruction 

LOD : 

#Array:d[l/0/0) at line 11 

20: ldc/16/0/0/ #load constant 

22 : str/0/0/26/ #store instruction, which 

LOD: 4 <= j tfwrites its outputs to main 

#Array:d [1/0/0] at line 13 #memory if 4<»j 

26: add/22/27/31/ ^addition 



27 :lod/26/0/0/ #load instruction, taking its inputs 
20 Dep(16): [0] [0] / Range: j <= 0 #from instruction 16 if j<»0 

Dep(22): [0] tl] / Range: 1 <= j #from instruction 22 otherwise 
LID: 

25 #Array:d [1/0/0] at line 13 

31 :mul/26/32/37/ #multiplication 
32 : lod/31/0/0/ #load instruction 
Dep(32): [1] [1] / Range: 1 <= i && 1 <= j 

30 

LID: i <= 0 || j <* 0 && l <= i #which takes its inputs from main 
#Array:src [1/-1/0] at line 13 tfmemory if i <= 0 | | j 0 && 1 <= i 
37: lod/31/0/0/ 

35 Dep(37): [1] [0] / Range: 1 <= i #load instruction 

LID: i <= 0 #taking its inputs from main memory if 
#Array: kernel [0/1/0] at line 13 #i<*0 

40 [0101] This text file is a representation of an annotated graph. The graph itself is shown in Figure 7. The graph clearly 
shows the dependencies found by the frontend algorithm. Edges 81 are marked with the condition under which a 
dependence exists, and the dependence distance where applicable. The description provided contains all the infor- 
mation necessary to generate a hardware component with the required functionality. 

[0102] The backend of the compilation toolchain has certain basic functions. One is to schedule and retime the 
45 extended dependence graph obtained from the frontend. This is necessary to obtain a fully functional CHESS config- 
uration. Scheduling involves determining a point in time for each of the nodes 82 in the extended dependence graph 
to be activated, and retiming involves, for example, the insertion of delays to ensure that edges propagate values at 
the appropriate moment. Scheduling can be performed using shifted-linear scheduling, a technique widely used in 
hardware synthesis. Retiming is a common and quite straightforward task in hardware synthesis, and merely involves 
so adding an appropriate number of registers to the circuit so that different paths in the circuit meet at the appropriate 
point in time. At this point, we have a complete description of the functionality of the coprocessor 2 (here, a CHESS 
coprocessor). This description is shown in Figure 8. This description can then be passed on to the appropriate tools 
to generate the sequence of signals (commonly referred to as "bitstream") necessary to program the CHESS coproc- 
essor with this functionality. 

ss [0103] Another function required of the backend is generation of the burst buffer and coprocessor controller schedule. 
Once the CHESS configuration has been obtained, it is apparent when it needs to be fed with values from main memory 
and when values can be stored back to main memory, and the burst buffer schedule can be established. Accordingly, 
a step is provided which involves splitting up the address space of all the data that needs to be loaded into or stored 
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mi™ p . mem ° ry 5 in, ° fiX6d burStS of data ,hat the burst buffers controller 7 is able to act upon 

* Tat an £"J£ T Pr6Sented ' ,h6 inpUt ara * < Src[] > is S P ,rt int ° ~™ < b "* °^ appropriate 

2 e a J? fo 8S ! ran9e need6d f ° r 1he al9 ° rithm is covered This too| chain uses bursts of length B, 

(where B len ,s a power of 2, and is specified as an execution parameter to the toolchain) to cover as much of the irSll 

^Sit z t r be achieved wi,h this burst ,ength ' tha ^ ™* 2 -T 

only one bursl ~' ' 1 Unt " ^ i0PUt addrSSS ne6ded for the a| 9° rithm belon ^ to one and 



L^omoutod Mother ^ ^ *? ^ ^ in itera,i ° n Spa ° e in Which any of the da,a ,oaded is 

is computed. In other words, to each input burst there is associated one point in the iteration space for which ii i« 

10 guaranteed that no earlier iterations need any of the data loaded by the burst. „ is easy to detec, when the execu on 

of the coprocessor 2 would reach that point in the iteration space. There are thus created: 

a toadburst instruction ,for the relevant addresses, in order to move data into burst buffer memory 5' and 
a .corresponding synchronisation point ( a CC_LX_DECREMENT / BB_LXJNCREMENT pair) to guarantee that 
« the execution of coprocessor 2 is synchronised with the relevant loadbunt instruction. 9^ranXee that 

[0106] To achieve an efficient overlap of computation and communication, the loadburst instruction has to be issued 
in advance, in order to hide the latency associa.ed with the transfer of data over the bus 

« to a^iiar'l^SnTfn f T * - """" * a ' 9 ° rithm * Pam ° nGCi im ° ° Ut P ut burs,s ' accordi "9 

to a similar logic. Again, the output space is partitioned into bursts of variable lenqth 

[0108] The toolchain creates: 

a storeburst instruction for the relevant addresses; 
^ a corresponding synchronization point (BB_XS_DECREMENT / CC_XS_INCREMENT pair) 

[0109] At this point, we possess information relevant to: 

TJ nS^° rd ! ri ^ 9 1 1O K dbUrSt and Sl ° rebUrSt ins,ructions ' and the " Parameters of execution (addresses, etc ) 
their position relative to the computation to be performed on coprocessor 2. 

m^sssrjir used to aenerate appropriate c fl °* ,o ° r9anise the ° verai1 computatio - « in *• 

[0111] The actual code generation phase (that is, the emission of the C code to run on processor 1) can be accom- 
p shed usmg the code generation routines contained in the Omega Library of the University of MarylanTavaSe at 

S^^EESXS ■ followed by a cus,omised script that — the 9 -- 

Experime ntal Results - Image Convolution 

[0112] An image convolution algorithm is described by the following loop nest: 

for (i = 0 ; i<IMAGE_HEIGHT; 
45 for ( j =0 ; j < IMAGE_WIDTH ; j ++ ) 

for (k=0;k<KERNEL_HEIGHT;k++) 



f or (1=0; 1<KERNEL_WIDTH; 1++) 

Dest[i,j] Source C(i+l)-k,(j-fl) -l]*C[k f l] ; 

55 ^dKER^T^^ t0 K 6nhanC ! SOUrCG ima9e by K ^^L_HEIGHT-1 pixels in the vertical direction 

fn 1^ ' n Z ° ntal direCti ° n in ° rder t0 Slmphfy b ° Undary c °nditions. Two kernels are used 

n evalua^g system performance: a 3x3 kernel and a 5x5 kernel, both performing median filtering 

[0114] Frgures 9 and 10 illustrate the performance of the architecture according to an embodiment of the invention 



18 



BNSDOCID: <EP 



1061438A1_I_> 



EP 1 061 438 A1 



20 



25 



(indicated as BBC) as against a conventional processor using burst buffers (indicated as BB) and a conventional 
processor-and-cache combination (indicated as Cache). Two versions of the algorithm were implemented, one with 
32-bit pixels and one with 8-bit pixels. The same experimental measurements were taken for different image sizes, 
ranging from 8x8 to 128x128, and for different burst lengths. 

5 [0115] As can be seen from the Figures, the BBC implementation showed a great performance advantage over the 
BB and the Cache implementations. The algorithm is relatively complex, and the overall performance of the system in 
both BB and Cache implementations is heavily compute-bound - the CPU simply cannot keep up because of the high 
complexity of the algorithm. Using embodiments of the invention, in which the computation is vastly more effective as 
it is carried out on the CHESS array (with its inherent parallelism), the performance is if anything lO-bound - even 

10 though IO is also efficient through effective use of burst buffers. Multimedia instructions (such as MIPS MDMX) could 
improve the performance of the CPU in the BB or the Cache implementations, as they can allow for some parallel 
execution of arithmetic instructions. Nonetheless, the performance enhancement resulting is unlikely to reach the per- 
formance levels obtained using a dedicated coprocessor in this arrangement. 

15 Modifications and variations 

[0116] The function of decoupling the processor 1 from the coprocessor 2 and the burst buffer memory 5 can be 
achieved by means other than the instruction queues 6,8. An effective alternative is to replace the two queues with 
two small processors (one for each queue) fully dedicated to issuing instructions to the burst buffers memory 5 and 
the coprocessor 2, as described in Figure 12. The burst instruction queue is replaced (with reference to the Figure 1 
embodiment) by a burst command processor 106, and the coprocessor instruction queue is replaced by a coprocessor 
command processor 108. Since this would be the only task carried out by these two components, there would be no 
need for them to be decoupled from the coprocessor 2 and the burst buffers 7 respectively. Each of the command 
processors 106, 108 could operate by issuing a command to the coprocessor or burst buffers (as appropriate), and 
then do nothing until that command has completed its execution, then issue another command, and so on. This would 
complicate the design, but would free the main processor 1 from its remaining trivial task of issuing instructions into 
the queues. The only work to be carried out by processor 1 would then be the initial setting up of these two processors, 
which would be done just before the beginning of the computation. During the computation, the processor 1 would thus 
be completely decoupled from the execution of the coprocessor 2 and the burst buffers memory 5. 
30 [01 1 7] Two conventional, but smaller, microprocessors (or, alternatively, only one processor running two independent 
threads of control) could be used, each one of them running the relevant part of the appropriate code (loop nest). 
Alternatively, two general state machines could be synthesised whose external behaviour would reflect the execution 
of the relevant part of the code (that is, they would provide the same sequence of instructions). The hardware complexity 
and cost of such state machines would be significantly smaller than that of the equivalent dedicated processors. Such 
state machines would be programmed by the main processor 1 in a way similar to that described above. The main 
difference would be that the repetition of events would be encoded as well: this is necessary for processor 1 to be able 
to encode the behaviour of one algorithm in a few (if complex) instructions. In order to obtain the repetition of an event 
x times, the processor 1 would not have to insert x instructions in a queue, but would have to encode this repetition 
parameter in the instruction definition. 

[0118] As indicated above, a particularly effective mechanism is for finite state machines (FSMs) to be used instead 
of queues to decouple the execution of the main processor 1 from the execution of coprocessor 2 and the burst buffers 
controller 7. This mechanism will now be discussed in further detail. 

[0119] In the architecture illustrated in Figure 1, instructions to drive the execution of different I/O streams can be 
mixed with instructions for execution of coprocessor 2. This is possible because the mutual relationships between 
system components is known at compile time, and therefore instructions to the different system components can be 
interleaved in the source code in the correct order. 

[0120] Two state machines can be built to issue these instructions for execution in much the same way. One such 
stale machine would conlrol the behaviour of the coprocessor 2, issuing CC_xxx_xxx instructions as required, and the 
other would control the behaviour of burst buffers controller 7, issuing BB_xxx_xxx instructions as required. 
[0121] Such state machines could be implemented in a number of different ways. One alternative is indicated in 
Figure 13. With reference to the vector addition example presented above, this state machine 150 (for the coprocessor 
2, though the equivalent machine for the burst buffers controller 7 is directly analogous) implements a sequence of 
instructions built from the pattern: 

55 CC_LX_DECREMENT, 
CCJ_X_DECREMENT, 
CC_START_EXEC, 
CC_XSJNCREMENT 
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main S,3te maCNn6 1 50 is effective| y broken up into simpler state machines 151 152 153 each of whirh 

SSJiSbi^cjjt," ds,in8d T c° i° te •* ,he even ' cojn, ° M64 » -° ato » wSrsr«» 

of <h« ,„ strU e„o„ l8 c^pleteo, ,h. even, com,, ,54 Is i„e,. m on,ed ag» h Thi, L,u«2. ol « Z L sum ™ 

1: Increment event counter: EC++ 
1S J ,? h °°! e St3te machine ' ,or execution if there exists an M such that M*Period i+ Phase=EC 

«iCS£2Sr ' ^ bSen f ° Und ' GXeCUt8 inS,rUCti0 " dSSCribed by S,ate machine f « his «*« "nclude 
4: Go back to 1 

20 ZllontoZ CC START txFc'Tf 'n I!" eX T li0n °' 30 inStrUCti ° n (addreSS6S to read from/wri,e l °. '°ng.h - 
moro thT CC - START - EX EC, etc.) will have to be encoded in the state machine 1 50. It should also be noteo that 

roi£ T 6 6 maCh ' ne ° an iSSUe 3 9iVCn instructio ". tyPioally with different parameters 

once it Jn reldtT e Tn k TH rtiCU,aHy T * 9 ""' Bto Peri0diC behaVi ° Ur HoWever ' if an "a. *» happen only 
2S ?J' ' I f 1 V 6nCOded ' n 8 S ' mple State machine wi,h infinite Period and finite phase, the only consequence 
25 being that this simple state machine will be used only the once consequence 

Ef JL? 1 " f PP :? Cl ! Can i,Self 09 varied ' For exam P'e- to add flexibility to the mechanism, a possible option is to 

sfmmJ 1 3 H ^ ,,mG Parameters to the sim P<e s ^e machines, in order to limit the execution of one or mo e 
simple state machines to a predetermined 'time window*. 

30 Eh th h8 Pr0g f ramming of ,hese s,ate machines would happen during the initialisation of the system for example 
th^aramLrr mem0ry ; mapped re 9 istere assi 9"ed by the processor 1 . An alternative would be the '.oadfng of a5 
he parameters necessary to program these state machines from a predefined region of main memory 3 nemans 

mi°^ h x! ° f 8 d6diCated Channel and a Direct Memor V (DMA) mechanism * ? P 

St m^ficLtfontTr meChaniSm S " 9 9 ested ' * USin9 tW ° dedicated microprocessors, would require no sig- 
nificant modification to the programming model for the architecture of Figure 1 . the same techniques used to oronram 

SZSSSS 1 r ld , be „ USed ' Wlth 30 addi,i ° nal St9p ° f Sp,min9 ~ ds 'tended 

TaZ T ^ St bUtt T COntr0 " er 7 - AlthOU9h f8asible ' this arrangement may be disadvantageous wfth respeS 
3 ?o ^thlr n R AM '^'^ " W ° Uld bS n8C8SSary f0r ,hese Pressors to be provided with access to main memS 
3 or otherDRAM.add.ngtothe complexity of the system. The cost and complexity of the system would also^e increased 

40 in y tNs : a 9 y (a UnderU,NiSin9 ' in that are on 'V P-ent to perform very'simp'le computation^ rSrop^ssors 

Sinn^fhf eV8,0pmentS bey ° nd th6 architecture ° f Fi 9ure 1 and its alternatives can also be made without 

otoS hTh T n T C ' P,eS ° f th6 inVen,i0n ' Thr6e SUCh areas of development will be described below 
mi?m o ? dependent cond,t,onals/unknown execution times, and non-affine accesses to memory. 

45 on .heir innTZ^I T ^ ^ Where applications require more than one transformation to be carried out 
r.m m lT h 1 Tf mS - f ° r ,nStanC6 ' 3 convo,utior ' may be followed immediately by a correlation In orders ac 
commodate th.s kind of arrangement, changes to both the architecture and the computational model w II be required 
Arch-tecturally, successive buffered CHESS arrays could be provided, ora larger partitioned CHESS array o'aCHESS 

so CHESsLrst buL" n T and /T^? 9 ^ CHESS ^ ^ 11 A ShoWS an arrangement with a staggered 
SheS arret iT rem a processor 143 and exchanging data with a main memory 1 44, where a 

thta ilnd ^1 1 h Tk « 8 fifSt °' bUfSt bUff6rS 142 and paSS6S il to a second set o« burst buffers 145 

this second se of burst buffers 145 interacting with a further CHESS array 146 (potentially this pipeline could be 

commune r her SGtS ? ° HESS array8 ^ bUrSt bUfferS) - S y-hronisation becomes more comp.e and nvTcs 
commun.cat.on between adjacent CHESS arrays and between adjacent sets of burst buffers, but the same Zeral 

s'e'Z^r L° ° Wed i° a " OW effiCi6nt USS °' bUrSt bUfferS ' and 6tficient synchronisation betwee CHESS alys 
pfpeJne ° 9Uarant66 COrrectness of th ^ computation carried out by successive stages ofthe 

[0131] Figure 11 B shows a different type of computational pipeline, with an SRAM cache 155 between two CHESS 
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arrays 151, 156. with loads provided to a first set of burst buffers 152 and stores provided by a second set of burst 
buffers 157. The role of the processor 153 and of the main memory 154 is essentially unchanged from other embodi- 
ments. Synchronisation may be less difficult in this arrangement, although the arrangement may also exploit parallelism 
less effectively. 

[01 32] One constraint on efficient use of the coprocessor in an architecture as described above is that the execution 
time of the coprocessor implementation should be known (to allow efficient scheduling). This is achievable for many 
media-processing loops. However, if execution times are unknown at compile time, then the scheduling requirements 
in the toolchain need to be relaxed, and appropriate allowances need to be made in the synchronisation and commu- 
nication protocols between the processor, the coprocessor and the burst buffers. The coprocessor controller also will 
need specific configuration for this circumstance. 

[0133] Another extension is to allow non-affine references to burst buffers memory. In the burst buffers model used 
above, all access is of the type Al+F. where A is a constant matrix, I is the iteration vector and F is a constant vector. 
Use of this limited access model allows the coprocessor controller and the processor to know in advance what data 
will be needed at any given moment m lime, allowing efficient creation of logical streams. The significance of this to 
the architecture as a whole is such thai it is unclear how non-affine access could be provided in a completely arbitrary 
way (the synchronisation mechanisms would appear to break down), but it would be possible to use non-affine array 
accesses to reference lookup tables I his could be done by loading lookup tables into burst buffers, and then allow 
the coprocessor to generate a burst butter address relative to the start of the lookup table for subsequent access. It 
would be necessary to ensure thai sucn addresses could be generated sufficiently far in advance to the time that they 
will be used (possibly this could be achieved by a relinement to the synchronisation mechanism) and to modify the 
logical stream mechanism to support this type of recursive reference. 

[01 34] Many variations and extensions to the architecture of Figure 1 can thus be carried out without deviating from 
the invention as claimed. 



Claims 

1 . A computer system, comprising: 
^0 a first processor; 

a second processor for use as a coprocessor to the first processor; 
a memory; and 

35 

a decoupling element; 



wherein instructions are passed to the second processor from the first processor through the decoupling element, 
such that the second processor consumes instructions derived from the first processor through the decoupling 
*o element, and wherein the second processor receives data from and writes data to the memory whereby the 

processing of instructions by the second processor is decoupled from the operation of the first processor. 

2. A computer system as claimed in claim 1 , wherein the decoupling element is a coprocessor instruction queue, 
wherein instructions are added to the coprocessor instruction queue by the first processor and consumed from the 

45 coprocessor instruction queue by the coprocessor. 

3. A computer system as claimed in claim 1 , wherein the decoupling element is a state machine, wherein information 
to provide instructions to the second processor is provided to the state machine by the first processor, and instruc- 
tions are provided in an ordered sequence to the second processor by the state machine. 

so 

4. A computer system as claimed in claim 1 , wherein the decoupling element is a third processor, wherein information 
to provide instructions to the second processor is provided to the third processor by the first processor, and in- 
structions are provided in an ordered sequence to the second processor by the third processor. 

55 5. A computer system as claimed in any preceding claim, further comprising a coprocessor controller for controlling 
the activity of the second processor and for synchronising the execution of the coprocessor with loads from memory. 

6. A computer system as claimed in any preceding claim, wherein the second processor is configurable. 
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8. A computer system as claimed in any preceding claim, wherein the first 
execution of instructions by the second processor. 



processor is able to switch tasks during 
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mg of memory .nstructions by the buffer memory is decoup.ed from the operation of the ftt pressor P 

12 ' aue°u?whL^ 88 Clai T d 5 C ' aim 1 1 ' Wh6rein thS S9COnd deCOU P lin 9 e,emenl is a buffer memory instruction 
queue, where.n memory instructions are added to the buffer memory instruction queue by the first processor and 
consumed from the buffer memory instruction queue by the buffer memory. Processor and 

13. A computer system as claimed in claim 11 , wherein the second decoupling element is a state machine wherein 
ooZT .nH^ 1 ^ mem0f V ins,ruolio - to the buffer memory is provfded to the staSmaThTne by ZZ 
pressor, and memory mstruct.ons are provided in an ordered sequence to the buffer memory by the stale ma- 

U ' lnfo?m P tinnt SyStem H S C ' aimed * C ' aim 11 ' Wh6rein *° SSCOnd dec °"P'ing element is a fourth processor wherein 
o ZZT T mem ° ry ' nS,rUCtions to the buffer ^mory is provided to the fourth processor by The frit 
processor, and memory .nstructions are provided in an ordered sequence to the buffer memory by the fourth proc- 

15. A computer system as claimed in any of claims 9 to 14, further comprising a synchronisation mechanism to svn 
chrcn.se transfer of data between the buffer memory and the memory with'xecution of tnlZZX^leoZ 



processor 



16 ' o1 SStatTST C ' aim H d ln C ' aim 15 ' Wh6rein th6 s y nchronis ^°n mechanism is adapted to block 

OT instructions bV the SfinnnH nmroeenr ...u;~u i . . . . r 



execution 



40 



instructions by the second processor on data which has not yet been loaded to the buffer memory from the 

memorv *T I ^ * **** lM ° nS ^ ^ ° f data f ™ he 

memory where relevant instructions have not yet been executed by the second 



processor. 



45 



1 ? * £ ISSlSlZ y t S h t9m aS ? aimed in c,aim 1 6 ' ada P ted ^ ^at when execution of instructions or memory instructions 
L ^nl V th t- SynChr °, n,sation ^hanism. other instructions or memory instructions which are notE?SI 
the synchronisation mechanism may be carried out. °ioc K ea oy 

18. A computer system as claimed in any previous claim, wherein the first 
a computer device. 

50 19. A method of operating a computer system, comprising: 
providing code for execution by a first processor; 



processor is the central processing unit of 



55 



p^Tr; 1 ^ COdG ° f 3 t3Sk tC bG Cafried ° Ut by 8 SGC ° nd Pr ° CeSSOr 3Ctin9 as c °P^essor to the first 
passing information defining the task from the first processor to a decoupling element; 
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passing instructions derived from said information from the decoupling element to the second processor and 
executing said instructions on the second processor, wherein the processing of said instructions by the second 
processor is decoupled from the operation of the first processor 
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Figure 6 
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